Based on Giulia Carella's Notebook presented in:
SDSC20 Workshop - Integrating CARTOframes into Spatial Data Science workflows. October 21 2020
In this practice we will be a Spatial Data Science in a geo-planning department. We will lead a project for site selection to our company, Starbucks. Site selections refers to the process of deciding where to open a new or relocate an exiting store/facility by comparing the merits of potential locations. In previous practices, we learnt the concept of potential as the total amount of expense/revenues that an entity (e.g. a company, shop, etc...) can generate. If some attribute of interest is known (e.g. the revenues of existing stores) statistical and machine learning methods can be used to help with the decision process.
In this practice, we will support our company to look at where Starbucks should open new coffee shops in Long Island, NY. Specifically, we will build a model for the revenues of the existing stores as a funcion of sociodemographic data and use this model to predict the expected revenue in each block group.
We will:
First of all, install virtualenv inside your project folder to secure your project directory to avoid conflict with your other packages.
pip install virtualenv
After installing this run this command one by one inside your root project directory:
virtualenv venv
source venv/bin/activate
Now Your directory is secure and you can install your required packages inside.
pip install -r requirements.txt
(In case of error when installing Rtree package, please, use the following instruction in your directory: pip install "rtree>=0.8,<0.9"
To exit the virtual environment type:
deactivate
from requirements import *
from ppca import PPCA
from utils import *
Access to you Carto account (www.Carto.com) and access to your CARTO dashboard (with your login and password) and click on API KEY to get your key.
carto_username = '*'
carto_API = '*'
set_default_credentials(username=carto_username, api_key=carto_API)
First we upload the Starbucks data that are stored in your local in the starbucks_long_island.csv file. The data contains the addresses of Starbucks stores, their annual revenue ($) and some store characteristics (e.g. the number of open hours per week).
stores = pd.read_csv('./data/starbucks_long_island.csv')
stores.head(3)
[EX 1] Explore the dataset to identify the number of registers or samples, variables, nulls and type of variables
stores.info(memory_usage=False)
print('\n' * 2)
print('The dataset has 177 entries with 11 different features each.')
print('All the variables are null-free.')
print('All the variables except: Name, Addresslines and StoreNumber are recognised as numbers (either int or float.)')
[EX 2] Plot the distribution and the boxplot of the annual revenues by store
fig = plt.figure(figsize=(5, 5))
fig,axes=plt.subplots(1,2, figsize=(15, 5))
sns.distplot(stores['revenue'], color="lightblue", ax=axes[0])
sns.boxplot(data=stores, y= 'revenue', orient='v', ax=axes[1])
plt.show()
[EX 3] Which other exploratory analysis do you propose to have a clear understanding of the data?
list_binary_varibables = []
for col in stores.columns:
if len(stores[col].unique()) == 2:
list_binary_varibables.append(col)
list_binary_varibables
#for variables with only two values, we plot countplots
for var in list_binary_varibables:
ax = sns.countplot(x=var, data=stores)
plt.show()
#for the other numerical variables, we plot a histogram and a boxplot
for var in ['hours_open_per_week', 'gc_status_rel']:
fig = plt.figure(figsize=(5, 5))
fig,axes=plt.subplots(1,2, figsize=(15, 5))
sns.distplot(stores[var], color="lightblue", ax=axes[0])
sns.boxplot(data=stores, y= var, orient='v', ax=axes[1])
plt.show()
Since we only know the addresses, we first need to geocode them to extract the corresponding geographical locations, which will be used to assign to each store the Census data of the corresponding block group.
One of the key services that Cartoframes provides in geocoding through the library: from cartoframes.data.services import Geocoding
Tip: You will find Cartoframes' API description in this link: https://carto.com/developers/cartoframes/reference/#heading-Data-Services
[EX 4] Apply the Cartoframe's geocoding service (i.e. Geocoding()) to geocode from addresslines variable in our dataframe. Store the output into the storesdataframe.
#stores = Geocoding().geocode(stores, 'addresslines').data
stores = gpd.read_file('./data/table_4c455fc710.geojson')
Done! Now that the stores are geocoded, you will notice a new column named geometry has been added. This column stores the geographic location of each store and it’s used to plot each location on the map.
[EX 5] Which is the name of the column with the geographic location? What kind of spatial data object is?
print('The name of the column with the geographic location is \' the_geo\' and it is a geometry object, a POINT.')
stores.head(3)
[EX 6] Let's create our first visualization using Cartoframe's Map and Layer classes. Use Map(Layer(dataframe))to build a map.
Tip: Check out the Visualization guide to learn more about the visualization capabilities inside of CARTOframes.
Map(Layer(stores))
[EX 7] Congrats! You have done your first map visualization. Now, create a new map with the following characteristics (see this link: https://carto.com/developers/cartoframes/reference/#heading-Map for further information about parameters of Map(Layer()):
Tip: You may use the function size_bins_style from Cartoframes. Introduce help(size_bins_style) or help(size_bins_legend) in the notebook to get information about these functions
plot_size = (920, 400)
breaks = [400000,700000, 1000000, 1300000, 1600000]
size_bins_st = size_bins_style('revenue', breaks = breaks)
size_bins_leg = size_bins_legend(title='Annual Revenue', description='STARBUCKS', footer='Size of the blob')
viewport = {'zoom': 10, 'lat':40.646891, 'lng': -73.869378}
Map(Layer(stores, size_bins_st, size_bins_leg), size = plot_size, viewport = viewport,show_info=True)
Carto gives us the option to upload the map or dashboard to our repository in Carto's cloud as follows:
to_carto(stores,'starbucks_long_island_geocoded', if_exists = 'replace')
Next, we will download from CARTO DATA Observatory (www.carto.com/data) the demographic and socioeconomic variables that we will use to build a model for the stores revenues. We will use data from the The American Community Survey (ACS), which is an ongoing survey by the U.S. Census Bureau. The ACS is available at the most granular resolution at the census block group level, with the most recent geography boundaries available for 2015. The data that we will use are from the survey run from 2013 to 2017.
As we are interested only in the data for the Long Island area, we first download the geographical boundaries of this area from the manhattan_long_island.geojson file, which is stored locally.
import geopandas as gpd
aoi = gpd.read_file('./data/manhattan_long_island.geojson')
If we represent the aoi.head() we obtain:
data = requests.get("https://data.cityofnewyork.us/resource/5rqd-h5ci.json")
[EX 8] Explore the geometry feature. What kind of spatial object is?
print('The spatial object is a polygon.')
aoi.info()
aoi.head()
[EX 9] Create a map layer to represent the Long Island area with the following characteristics:
viewport = {'zoom': 7.5, 'lat':40.845419, 'lng': -72.841376}
plot_size = (920, 400)
basicstyle = cartoframes.viz.basic_style(color = '#eaff00', stroke_color = '#eaff00', stroke_width = 2, opacity = 0.2)
Map(Layer(aoi, basicstyle), size = plot_size, viewport = viewport,show_info=True)
More documentation here: https://carto.com/developers/cartoframes/guides/Data-discovery/
Let's start by exploring the Data Catalog to see what data categories CARTO can offer for data in the US.
Catalog().country('usa').categories
Then, for demographics, let's check the available providers and geometries and select the most recent ACS data at the block group level
Catalog().country('usa').category('demographics').datasets.to_dataframe().provider_id.unique()
Let's select usa_acs provider from the previous ones and check the catalogue of enriched data available.
catalog_usa_demo = Catalog().country('usa').category('demographics').datasets.to_dataframe()
catalog_usa_demo_acs = catalog_usa_demo[catalog_usa_demo.provider_id == 'usa_acs']
catalog_usa_demo_acs.geography_name.unique()
Now, let's select the data from 'Census Block Group - United States of America (2015)'
catalog_usa_demo_acs_bk = catalog_usa_demo_acs[catalog_usa_demo_acs.geography_name == 'Census Block Group - United States of America (2015)']
As this Census Block Group can have different tables we should list and select the most recent one. So if we list all tables:
catalog_usa_demo_acs_bk.id.to_list()
To explore in greater detail the metadata associated to this table, we can use the Datasetcommand from Carto as follows:
Dataset.get('carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017').to_dict()
[EX10] According to the data description which is the main characteristics of the census block data in terms of time aggregation, private or public, updating periodicity, etc....?
print('This census block data has a temporal aggregation of 5 years, from 2013 to 2018 (not included). The data is public. ')
If we do the same exploration for the geometries associated to the selected census block, we should use the Geography method from Carto as follows:
Geography.get('carto-do-public-data.carto.geography_usa_blockgroup_2015').to_dict()
Finally, we will build the query to get the Census data and Census geometries of the polygons that interesects with the Long Island geometry previously defined. To do it, we apply the following code:
df_X = Dataset.get('acs_sociodemogr_b758e778')
q = "select * from $dataset$ where ST_Intersects(geom,st_geogfromtext('{}'))".format(aoi.geometry.astype(str)[0])
df_X = df_X.to_dataframe(sql_query=q)
df_X['geometry'] = df_X['geom'].apply(lambda x: str_to_geom(x))
df_X = GeoDataFrame(df_X,geometry = df_X.geometry)
df_X.crs = {'init' :'epsg:4326'}
df_X.to_crs(epsg=4326, inplace=True)
df_X.head(3)
[EX11] Plot a map with df_X resulting from the interesetion of the Census Blocks and Long Island Geometry.
Map(Layer(df_X))
[EX12] Filter the first 1000 census block in df_X and build the same map as before. What's the difference respect to EX11? Why?
print('Now only a 1000 census blocks are painted as we are not taking the whole dataframe.')
Map(Layer(df_X[:1000]))
When comparing data for irregular units like census block groups, extra care is needed for extensive variables, i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population. For extensive variables in fact we need first to normalize them, e.g. by dividing by the total area or the total population, depending on the application. Using the metadata available in CARTO Data Observatory we will check which of all the variables can be considered extensive by looking at their default aggregation method: if the aggregation method is classified as SUM, then the variable is normalized by dividing by the total population.
## Get the default aggregation methods
agg_methods_table = read_carto("""
SELECT column_name,
agg_method
FROM variables_public
WHERE dataset_id = 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017'""",
credentials = Credentials(username='do-metadata', api_key='default_public'))
agg_methods_table.dropna(inplace = True)
agg_methods_table.head()
[EX13] For census block dataset, df_X, compute densities (i.e. column value divided by the the total population) for each numerical column that appears in agg_methods_table. Store the new value in a column with '_dens' suffix and delete the original one.
sum_var = agg_methods_table[agg_methods_table.agg_method == 'SUM']['column_name']
for item in sum_var:
if item != 'total_pop':
df_X[item] = df_X[item]/df_X["total_pop"]
df_X[item+'_dens']=df_X[item]
del df_X[item]
df_X
df_X.columns.values
The columns of your df_X should look like as follows:
[EX14] Delete unused columns with the following characteristics:
total_popshould not beWe should keep the variables geo_id and geometry.
for col in df_X.columns:
if col not in ["geoid", "geometry"]:
if col not in df_X.select_dtypes(include=['float64']):
del df_X[col]
if "do-" in col:
del df_X[col]
df_X.drop(['total_pop'], axis = 1, inplace=True)
df_X.head(3)
Let's store in df_X_cols variable a list of the columns in df_X except of geoidand geometry.
df_X_cols = list(df_X.drop(['geoid','geometry'], axis = 1).columns)
Your dataframe should look like as:
df_X.head()
Missing data are common in surveys, as the ACS. Before using the ACS data as covariates we therefore need to check if there are any missing data and to decide how to impute the missing values.
[EX15] Identify which variables have missing data and how many samples are missing.
df_X.isnull().sum()
[EX16] If we drop all samples with at least a missing value, which will be the size of this new dataframe? Justify if it is enough to continue with this new dataframe or not
print(df_X.shape)
df_X_reduced = df_X.dropna()
print(df_X_reduced.shape)
print('The size of the new dataframe without the rows with Na values will only have 407 records out from 5897 it had previously. We are losing more than 90% of the data, there is no way we can continue, the results won\'t be significant and can be biased.')
Next, we will check the pair-wise correlations between the ACS variables.
[EX17] Calculate and plot the correlation matrix for 50 variables randomly selected. What do you observe?
def corr_heatmap(data):
data_corr = data.corr().round(2)
f, ax = plt.subplots(figsize=(15,15))
sns.heatmap(data_corr, vmin=-1, vmax=1, linewidths=.5, ax=ax)
plt.show()
corr_heatmap(df_X.sample(n=50,axis='columns'))
print('We observe that, in general, the correlation among variables is small. However we can see some exceptions with positive (close to 1.0) and negative (close to -1.0) correlated variables.')
To impute the missing data and reduce the model dimensionality we will use a probabilistic formulation of Principal Component Analysis (PCA). PCA is a technique to transform data sets of high dimensional vectors into lower dimensional ones that finds a smaller-dimensional linear representation of data vectors such that the original data could be reconstructed from the compressed representation with the minimum square error.
In its probabilistic formulation, probabilistic PCA (PPCA) the complete data are modelled by a generative latent variable model which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components are not observed and are treated as latent variables, which also allows to extend the method to when missing data are present.
More information can be found at http://www.jmlr.org/papers/volume11/ilin10a/ilin10a.pdf.
PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model, which is known as PPCA:
\begin{equation*} Y_{ij} = \mathbf{P}_i \, \mathbf{Z}_j + m_i + \varepsilon_{ij} \quad i = 1, .., d; \, j = 1, .., n \end{equation*}with
\begin{align} p(\mathbf{Z}_j) \sim N(0, \mathbb{1}) \\ p(\varepsilon_{ij}) \sim N(0, \nu) \end{align}Both the principal components $Z$ and the noise $\varepsilon$ are assumed normally distributed. The model can be identified by finding the maximum likelihood (ML) estimate for the model parameters using the Expectation-Maximization (EM) algorithm by minimizing the mean-square error of the observed part of the data. EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components, $Z_i$, are not observed and are treated as latent variables. When missing data are present, in the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the latent variables given the observed variables. In this case, the update EM rules are the following.
E-step: \begin{align} \mathbf{\Sigma}_{\mathbf{Z}_j} = \nu \left(\nu \, \mathbb{1} + \sum_i \mathbf{P}_i \mathbf{P}_i^T \right)^{-1} \\ \overline{\mathbf{Z}}_j = \dfrac{1}{\nu}\mathbf{\Sigma}_{\mathbf{Z}_j} \sum_i \mathbf{P}_i \left(Y_{ij}- m_i \right) \\ m_{i} = \dfrac{1}{n} \sum_j \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \right) \\ \end{align}
M-step: \begin{align} \mathbf{P}_{i} = \left( \sum_j \overline{\mathbf{Z}}_j \overline{\mathbf{Z}}_j ^T + \mathbf{\Sigma}_{\mathbf{Z}_j} \right)^{-1} \sum_j \overline{\mathbf{Z}}_j \, \left(Y_{ij}- m_{ij} \right)\\ \nu = \dfrac{1}{n} \sum_{ij} \left[ \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j - m_i \right)^2 + \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \mathbf{P}_i \, \right] \end{align}
where each row of $\mathbf{P}$ and $\overline{\mathbf{Z}}$ is recomputed based only on those columns of $\overline{\mathbf{Z}}$ which contribute to the reconstruction of the observed values in the corresponding row of the data matrix.
To impute missing values and extract PC scores (linearly decorrelated) we will use PPCA for our the df_X_colsof our dataframe as follows:
X, var_exp = run_ppca(df_X, df_X_cols, min_obs = 0.8)
At the tail columns we can observe the Principal Component variables:
X.head()
[EX18] Plot the explained variance resulting from Principal Component Analysis. Tip: You can use the function plot_pc_var(var_exp, textsize, title). Which is the Principal Component that explains the most of the variance?
plot_pc_var(var_exp, textsize=15, title='PCA Explained Variance')
print('The Principal Component that explains most of the variance is the first one. The PC are sorted in decreasing order on explaining the variance.')
We decide to keep the PCs up that explain up to 80% of the variance (the black dashed line in the plot below), but this choice is somewhat arbitrary and might vary depending on the application.
[EX19] To understand the relashionship between the transformed variables (the PC scores) and the original variable, we can plot the 4 variables most highly correlated with each PC. Use the function plot_pc_corr (X, df_X_cols)to plot these relationships. Which is the most positively correlated variable with PC0? Ant the most negatively correlated variable with PC0?
Note: depending on the random seed on your computer, the signs of the correlations might change.
plot_pc_corr(X, df_X_cols)
print('The most positively correlated variable with PC0 is HOUSING UNITS RENTER OCCUPIED DENS with more than 0.75. On the other hand, the most negatively correlated variable is OWNER OCCUPIED HOUSING UNITS DENS with a similar absolute value.')
[EX20] Create 3 maps as follows:
Principal Componentowner_occupied_housing_units_dens variablehousing_units_renter_occupied_densvariable# geo-visualization of the first Principal Component
Map([Layer(X,style=color_bins_style('pc_0'), encode_data=False)])
# geo-visualization of `owner_occupied_housing_units_dens` variable
Map([Layer(X,style=color_bins_style('owner_occupied_housing_units_dens'))])
# geo-visualization of `housing_units_renter_occupied_dens`variable
Map([Layer(X,style=color_bins_style('housing_units_renter_occupied_dens'))])
[EX21] Select only up the PC up to explain 80 % of the variance (dimensionality reduction)
index_list = ['geoid', 'geometry', 'pc_0']
cum_var_exp = np.cumsum(var_exp)
for index in range (0, len(cum_var_exp)):
if cum_var_exp[index] < 0.8:
index_list.append('pc_' + str(index+1))
#drop_cols = np.setdiff1d(X.columns, index_list)
for col in X.columns:
if col not in index_list:
X.drop([col], axis = 1, inplace=True)
X
Your new X dataframe should look like as follows:
X
Finally we have to merge the Census Block (i.e. X dataframe) with the initial dataset with the information about Starbucks shops (i.e. storesdataframe). To do it, we will do a spatial join based on geometryattributes.
stores_enriched = gpd.sjoin(stores[['revenue','geometry']], X, how='right', op='intersects').drop('index_left', axis = 1)
stores_enriched
[EX 22] How many samples are with any value in revenue. Why are only 177 census blocks with revenue value?
print(f'There are {stores_enriched.revenue.count()} with any value in \'revenue\'.')
print('There are only 177 census blocks with revenue value because those are the ones with starbucks stores on them and thus we can know the revenue value of that census blocks but no the others. ')
In this section of the project we are almost ready to train a model. As a regression problem, we will use a Linear Regression model as a baseline algorithm.
[EX 23] Build a dataframe df1 with only samples with census blocks with any value in revenuefeature.
df1 = stores_enriched.dropna()
df1
Your dataframe should look like as follows:
df1
[EX 24] Build the X and y matrices with the following characteristics:
y = df1.revenue/10**5
filter_col = [col for col in df1 if col.startswith('pc_')]
X = df1[filter_col]
[EX 25] Train a Linear Regressionmodel and train it with X and y. Make a prediction and store it in df1in the new y_hatcolumn.
from sklearn.linear_model import LinearRegression
lin_regr = LinearRegression()
lin_regr.fit(X, y)
df1['y_hat'] = lin_regr.predict(X)
pandas_scatter(df1,'revenue','y_hat', xlab = 'Observed', ylab = 'Predicted', pseudo_R2 = True)
If we exectute the function pandas_scatterwe can have a first measure of the performance. Your scatter plot should be similar to the following one:
pandas_scatter(df1,'revenue','y_hat', xlab = 'Observed', ylab = 'Predicted', pseudo_R2 = True)
[EX 25] To measure the performance in the right way, we should split the X and y to generate a X_train, y_train, X_test and y_test (80% for training and 20% for test). Which is the parameter to evaluate a regression model?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print('The parameter to evaluate a regression model is the R2 value, that shows how much of variability in dependent variable can be explained by the model.')
[EX 26] Train a linear regression model and evaluate the score. Use the function r2_score(y_actual, y_prediction)to calculate the R^2 for y_train and y_test.
from sklearn.linear_model import LinearRegression
Lin_Regr = LinearRegression()
Lin_Regr.fit(X_train, y_train)
y_predicted = Lin_Regr.predict(X_test)
r2_score(y_test, y_predicted)
[EX 27] Finally, we can use this model to predict the annual revenue ($) for each block group in Long Island. Make a copy of stores_enricheddataframe and store it in df2variable and execute the prediction based on the previous trained linear regression model. Store the prediction in a new revenue_predcolumn in df2 dataframe.
df2 = stores_enriched.copy()
df2['revenue_pred'] = Lin_Regr.predict(df2[filter_col])
df2.head()
[EX 28] Create a map visualization with the following characteristics:
df2and the revenue_predvariable. Include also a widget to show a histogram for revenue_predstoresand the revenuevariable (similar to EX7)plot_size = (920, 300)
breaks = [400000,700000, 1000000, 1300000, 1600000]
size_bins_st = size_bins_style('revenue', breaks = breaks)
size_bins_leg = size_bins_legend(title='Revenue', description='STARBUCKS', footer='Size of the blob')
viewport = {'zoom': 9.4, 'lat':40.673790, 'lng': -73.992685}
Map([Layer(df2,style=color_bins_style('revenue_pred'), title='Predicted Revenue',
widgets = histogram_widget('revenue_pred', title = 'Histogram of revenue_pred')),
Layer( stores, size_bins_st, size_bins_leg, title='Revenue')], size=plot_size, viewport=viewport, show_info=True)
[EX 29] Select 3 areas where to place a new Starbucks shop. Justify your answer.
print('To place a new Starbucks store, we have to take into consideration the closeness of other Starbucks stores and the Predicted Revenue of the census block, to expect higher benefits.')
print('According to these reasons, I would place Starbucks store in the area of: viewport = {\'zoom\': 9.94, \'lat\':40.858563, \'lng\': -73.596159}, where there are high predicted revenue values and no stores close. Plus the ones closer are on the top break level of revenue')
print('The second one, and following the same procedures, would be placed in the area of: viewport = {\'zoom\': 9.99, \'lat\':40.800242, \'lng\': -73.385038}. And a third one, could be on the end of the island or maybe a better option is within the area of viewport = {\'zoom\': 9.91, \'lat\':40.846007, \'lng\': -72.843293}, since there is a big area without Starbucks stores nearby and the revenue prediction value of the area is pretty high (dark).')
If we would like to upload to Carto to share the new visualization, we should apply the following script:
to_carto(df2,'starbucks_long_island_predictions', if_exists = 'replace')